Prosper is America’s first marketplace peer-to-peer lending platform, with over $12 billion in funded loans. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit score, etc. The explaination of all the variables in the data can be found here
In this project, an Exploratory Data Analysis (EDA) for the Prosper loan data set was conducted, in order to dig out some inspiring and valuable insights of the loan. Then Machine learning algorithms was applied to predict the default risk.
- Introduction 1.1 Background 1.2 Install R packages 1.3 Data loanding and exploring
- Data Cleaning 2.1 Missing values 2.2 Drop columns and rows 2.3 Filling null values with column mean
- Exploratory Data Analysis 3.1 Univariate Plots Section 3.2 Bivariate Plots Section 3.3 Multivariate Plots Section
- Feature Engineering 4.1 Data pre-processing 4.2 Feature selection 4.3 Model validation and selection 4.4 Default prediction
- Conclusion and summary
In this analysis, I used 10 packages for data manipulation, data visulization and correlation analysis. The name are below: ggplot2, date, dplyr, gridExtra, scales, choroplethrMaps, chroplethr, GGally, knitr.
## There are 113937 observations of 81 variables.
1.The Prosper loan data contains 113,937 loans that were issued through 2005-2014, with 81 variables for each loan. 2. The target is the loan status which could help investors estimate the default risk of borrowers. 3. The interesting features that might affect the loan status could from numerical variables, including borrower rate, monthly income,loan amount, credit score, term of the loan, total prospers loans. The fetures could also from and categorical variables, including employment status, loan purpose, home ownership, borrower state.
## There are 0 dupplicates
## ListingKey ListingNumber
## 0 0
## ListingCreationDate CreditGrade
## 0 0
## Term LoanStatus
## 0 0
## ClosedDate BorrowerAPR
## 0 25
## BorrowerRate LenderYield
## 0 0
## EstimatedEffectiveYield EstimatedLoss
## 29084 29084
## EstimatedReturn ProsperRating..numeric.
## 29084 29084
## ProsperRating..Alpha. ProsperScore
## 0 29084
## ListingCategory..numeric. BorrowerState
## 0 0
## Occupation EmploymentStatus
## 0 0
## EmploymentStatusDuration IsBorrowerHomeowner
## 7625 0
## CurrentlyInGroup GroupKey
## 0 0
## DateCreditPulled CreditScoreRangeLower
## 0 591
## CreditScoreRangeUpper FirstRecordedCreditLine
## 591 0
## CurrentCreditLines OpenCreditLines
## 7604 7604
## TotalCreditLinespast7years OpenRevolvingAccounts
## 697 0
## OpenRevolvingMonthlyPayment InquiriesLast6Months
## 0 697
## TotalInquiries CurrentDelinquencies
## 1159 697
## AmountDelinquent DelinquenciesLast7Years
## 7622 990
## PublicRecordsLast10Years PublicRecordsLast12Months
## 697 7604
## RevolvingCreditBalance BankcardUtilization
## 7604 7604
## AvailableBankcardCredit TotalTrades
## 7544 7544
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## 7544 7544
## DebtToIncomeRatio IncomeRange
## 8554 0
## IncomeVerifiable StatedMonthlyIncome
## 0 0
## LoanKey TotalProsperLoans
## 0 91852
## TotalProsperPaymentsBilled OnTimeProsperPayments
## 91852 91852
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 91852 91852
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 91852 91852
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 95009 0
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination
## 96985 0
## LoanNumber LoanOriginalAmount
## 0 0
## LoanOriginationDate LoanOriginationQuarter
## 0 0
## MemberKey MonthlyLoanPayment
## 0 0
## LP_CustomerPayments LP_CustomerPrincipalPayments
## 0 0
## LP_InterestandFees LP_ServiceFees
## 0 0
## LP_CollectionFees LP_GrossPrincipalLoss
## 0 0
## LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments
## 0 0
## PercentFunded Recommendations
## 0 0
## InvestmentFromFriendsCount InvestmentFromFriendsAmount
## 0 0
## Investors
## 0
## ListingCreationDate Term
## 0 0
## LoanStatus BorrowerRate
## 0 0
## ListingCategory..numeric. BorrowerState
## 0 0
## EmploymentStatus EmploymentStatusDuration
## 0 0
## IsBorrowerHomeowner CreditScoreRangeLower
## 0 0
## CreditScoreRangeUpper OpenCreditLines
## 0 0
## OpenRevolvingAccounts OpenRevolvingMonthlyPayment
## 0 0
## InquiriesLast6Months TotalInquiries
## 0 0
## CurrentDelinquencies AmountDelinquent
## 0 0
## DelinquenciesLast7Years PublicRecordsLast10Years
## 0 0
## PublicRecordsLast12Months RevolvingCreditBalance
## 0 0
## BankcardUtilization AvailableBankcardCredit
## 0 0
## TotalTrades TradesNeverDelinquent..percentage.
## 0 0
## TradesOpenedLast6Months DebtToIncomeRatio
## 0 8404
## StatedMonthlyIncome LoanOriginalAmount
## 0 0
## MonthlyLoanPayment Recommendations
## 0 0
## Investors
## 0
## Term LoanStatus
## 0 0
## BorrowerRate ListingCategory
## 0 0
## BorrowerState EmploymentStatus
## 0 0
## EmploymentStatusDuration IsBorrowerHomeowner
## 0 0
## CreditScoreRangeLower CreditScoreRangeUpper
## 0 0
## OpenCreditLines OpenRevolvingAccounts
## 0 0
## OpenRevolvingMonthlyPayment InquiriesLast6Months
## 0 0
## TotalInquiries CurrentDelinquencies
## 0 0
## AmountDelinquent DelinquenciesLast7Years
## 0 0
## PublicRecordsLast10Years PublicRecordsLast12Months
## 0 0
## RevolvingCreditBalance BankcardUtilization
## 0 0
## AvailableBankcardCredit TotalTrades
## 0 0
## TradesNeverDelinquent TradesOpenedLast6Months
## 0 0
## DebtToIncomeRatio StatedMonthlyIncome
## 0 0
## LoanOriginalAmount MonthlyLoanPayment
## 0 0
## Recommendations Investors
## 0 0
## LoanCreationDate
## 0
## There are 106290 observations of 33 variables.
In this section, I want to start exploring the data by visulization The fetures that I am interested to investigate are below:
In risk management one important job is to build a predictive model to predict whether the loan will be default or not. So the loan status wil be the target variable. However, there are 12 loan status. But we need to label them as default or non defaul loans. ‘delinquent’, ‘default’ and ‘charge off’ are terms helping investor to estimate whether or not they can expect to collect on the outstanding debt at all. According to the definition of delinquencies from badcredit.org, an account will be considered severely delinquent if no payment has been received 60 days beyond the due date.Those delinquent accounts may have higher risk of default.
So I created a new variable ‘DelinquentBorrowers’ using ‘0’ to represent none delinquent borrowers and ‘1’ to represent delinquent borroweres, including loans that are noted default, charged off, and past due over 60 days. According to LendingClub, In general, a note goes into default status when it is 121 or more days past due. When a note is in default status, Charge Off occurs no later than 150 days past due (i.e. No later than 30 days after the Default status is reached) when there is no reasonable expectation of sufficient payment to prevent the charge off.
## # A tibble: 2 x 3
## DelinquentBorrowers counts Percentage
## <dbl> <int> <dbl>
## 1 0 91736 86.3
## 2 1 14554 13.7
The loan categories are give as numerical variables, in order to investigate the correlation of loan purpose with other features, the ‘ListingCategory’ column was coverted into categorical variable.
-The monthly income is related to employment status, I cut the montly income into 6 levels and created a new variable MonthlyIncome.bucket. Interestingly, not employed does not mean low monthly income, employed and self-employed people also have very high monthly income.
- The prospser loans dataset contains over 100k observations with 81 variables spanning across 10 years.The first step before conducting any data analysis is to understand the variables, terminology and general domain knowledge of financial peer-to-peer lending.Second, it is very important to determine which variables to analyze and stick to those variables without drifting too far off.Also there are so many missing values and none specific observations, cleaning is needed for looking into the relationship between some variables.
- For loan data, I believe that default risk analysis is a key component to help investors to decide if they could collect the full loan.So any features that could lead to default are worth investigating, In my analysis, I found that borrower rate from investor’s aspect and Prosper score from borrower’s aspect are strongly relatted to deliquencies. This brought me to investigate more features that could possibly affect borrower rate and prosper score, including monthly income, borrower region, loan purpose, employment status, credit card utilization, available credit, loan amount,home ownership, and debt to income ratio.
The Exploratory Data Analysis strategy is a good way to find some insights of the date through interesting visualizations. However, EDA works better for data set with limited variables. For the Prosper Loan data with 81 variables, it is really time consuming to find out the most intesting features and investigate the correlations between them. In the future,an useful method to investigate this data could be building predictive models using machine learning.
- The process of analyzing Prosper loan data is a good experience in data cleanning, visulizaion, and feature engineering.
- The advantages of random forest modeling include telling people about the importance of each variables.
- Some other features not included it the dataset could be also very important, such as the borrower’s age, and gender.
- This project could go further by carefully selecting the features. Future work could be predicting the loss and profit, predicting the interest rate to minimize the default, etc.